Skip to main content

Newton's Method

Multi-Dimensional Newton's Method

Recall the one-dimensional Newton's Method presented in One-Dimensional Search Methods. Similarly, in a multi-dimensional setting, we minimize the approximate (quadratic) function that approaches to the value of the minimizer of the original function, illustrated in Figure 9.1.

We can obtain a quadratic approximation to the twice continuously differentiable objection function f:RnRf:\mathbb{R}^n\to\mathbb{R} using the Taylor series expansion of ff about the current point x(k)x^{(k)}, neglecting terms of order three and higher. We obtain

f(x)f(x(k))+(xx(k))Tg(k)+12(xx(k))TF(x(k))(xx(k))q(x),\begin{gather*} f(x) \approx f(x^{(k)})+(x-x^{(k)})^Tg^{(k)}+{1\over2}(x-x^{(k)})^TF(x^{(k)})(x-x^{(k)}) \triangleq q(x), \end{gather*}

where, for simplicity, we use the notation g(k)=f(x(k))g^{(k)}=\nabla f(x^{(k)}). Applying the FONC to qq yields

0=q(x)=g(k)+F(x(k))(xx(k)).\begin{gather*} 0=\nabla q(x)=g^{(k)}+F(x^{(k)})(x-x^{(k)}). \end{gather*}

if F(x(k))>0F(x^{(k)})>0, then qq achieves a minimum at

x(k+1)=x(k)F(x(k))1g(k).\begin{gather*} x^{(k+1)}=x^{(k)}-F(x^{(k)})^{-1}g^{(k)}. \end{gather*}

This recursive formula represents Newton's method.

Quasi-Newton Methods

A computational drawback of Newton's method is the need to evaluate F(x(k))\boldsymbol{F}\left(\boldsymbol{x}^{(k)}\right) and solve the equation F(x(k))d(k)=g(k)\boldsymbol{F}\left(\boldsymbol{x}^{(k)}\right) \boldsymbol{d}^{(k)}=-\boldsymbol{g}^{(k)} [i.e., compute d(k)=F(x(k))1g(k)\boldsymbol{d}^{(k)}= -\boldsymbol{F}\left(\boldsymbol{x}^{(k)}\right)^{-1} \boldsymbol{g}^{(k)}]. To avoid the computation of F(x(k))1\boldsymbol{F}\left(\boldsymbol{x}^{(k)}\right)^{-1}, the quasi-Newton methods use an approximation to F(x(k))1\boldsymbol{F}\left(\boldsymbol{x}^{(k)}\right)^{-1} in place of the true inverse. This approximation is updated at every stage so that it exhibits at least some properties of F(x(k))1\boldsymbol{F}\left(\boldsymbol{x}^{(k)}\right)^{-1}. To get some idea about the properties that an approximation to F(x(k))1\boldsymbol{F}\left(\boldsymbol{x}^{(k)}\right)^{-1} should satisfy, consider the formula

x(k+1)=x(k)αHkg(k),\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha \boldsymbol{H}_k \boldsymbol{g}^{(k)},

where Hk\boldsymbol{H}_k is an n×nn \times n real matrix and α\alpha is a positive search parameter. Expanding ff about x(k)\boldsymbol{x}^{(k)} yields

f(x(k+1))=f(x(k))+g(k)(x(k+1)x(k))+o(x(k+1)x(k))=f(x(k))αg(k)Hkg(k)+o(Hkg(k)α).\begin{aligned} f\left(\boldsymbol{x}^{(k+1)}\right) &=f\left(\boldsymbol{x}^{(k)}\right)+\boldsymbol{g}^{(k) \top}\left(\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}\right)+o\left(\left\|\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}\right\|\right) \\ &=f\left(\boldsymbol{x}^{(k)}\right)-\alpha \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \boldsymbol{g}^{(k)}+o\left(\left\|\boldsymbol{H}_k \boldsymbol{g}^{(k)}\right\| \alpha\right) . \end{aligned}

As α\alpha tends to zero, the second term on the right-hand side of this equation dominates the third. Thus, to guarantee a decrease in ff for small α\alpha, we have to have

g(k)Hkg(k)>0.\boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \boldsymbol{g}^{(k)}>0 .

A simple way to ensure this is to require that Hk\boldsymbol{H}_k be positive definite. We have proved the following result.

Proposition

Let fC1,x(k)Rn,g(k)=f(x(k))0f \in \mathcal{C}^1, \boldsymbol{x}^{(k)} \in \mathbb{R}^n, g^{(k)}=\nabla f\left(\boldsymbol{x}^{(k)}\right) \neq 0, and Hk\boldsymbol{H}_k an n×nn \times n real symmetric positive definite matrix. If we set x(k+1)=\boldsymbol{x}^{(k+1)}= x(k)αkHkg(k)\boldsymbol{x}^{(k)}-\alpha_k \boldsymbol{H}_k \boldsymbol{g}^{(k)}, where αk=argminα0f(x(k)αHkg(k))\alpha_k=\arg \min _{\alpha \geq 0} f\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{H}_k \boldsymbol{g}^{(k)}\right), then αk>0\alpha_k>0 and f(x(k+1))<f(x(k))f\left(\boldsymbol{x}^{(k+1)}\right)<f\left(\boldsymbol{x}^{(k)}\right)

In constructing an approximation to the inverse of the Hessian matrix, we should use only the objective function and gradient values. Thus, if we can find a suitable method of choosing Hk\boldsymbol{H}_k, the iteration may be carried out without any evaluation of the Hessian and without the solution of any set of linear equations.

Approximating the Inverse Hessian

Let H0,H1,H2,\boldsymbol{H}_0, \boldsymbol{H}_1, \boldsymbol{H}_2, \ldots be successive approximations of the inverse F(x(k))1\boldsymbol{F}\left(\boldsymbol{x}^{(k)}\right)^{-1} of the Hessian. We now derive a condition that the approximations should satisfy, which forms the starting point for our subsequent discussion of quasiNewton algorithms. To begin, suppose first that the Hessian matrix F(x)\boldsymbol{F}(\boldsymbol{x}) of the objective function ff is constant and independent of x\boldsymbol{x}. In other words, the objective function is quadratic, with Hessian F(x)=Q\boldsymbol{F}(\boldsymbol{x})=\boldsymbol{Q} for all x\boldsymbol{x}, where Q=Q\boldsymbol{Q}=\boldsymbol{Q}^{\top}. Then,

g(k+1)g(k)=Q(x(k+1)x(k))\boldsymbol{g}^{(k+1)}-\boldsymbol{g}^{(k)}=\boldsymbol{Q}\left(\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}\right)

Let

Δg(k)g(k+1)g(k)\Delta \boldsymbol{g}^{(k)} \triangleq \boldsymbol{g}^{(k+1)}-\boldsymbol{g}^{(k)}

and then, we may write

Δx(k)x(k+1)x(k).\Delta \boldsymbol{x}^{(k)} \triangleq \boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)} .

We start with a real symmetric positive definite matrix H0\boldsymbol{H}_0. Note that given kk, the matrix Q1\boldsymbol{Q}^{-1} satisfies

Q1Δg(i)=Δx(i),0ik.\boldsymbol{Q}^{-1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, \quad 0 \leq i \leq k .

Therefore, we also impose the requirement that the approximation Hk+1\boldsymbol{H}_{k+1} of the Hessian satisfy

Hk+1Δg(i)=Δx(i),0ik.\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, \quad 0 \leq i \leq k .

If nn steps are involved, then moving in nn directions Δx(0),Δx(1),,Δx(n1)\Delta \boldsymbol{x}^{(0)}, \Delta \boldsymbol{x}^{(1)}, \ldots, \Delta \boldsymbol{x}^{(n-1)} yields

HnΔg(0)=Δx(0),HnΔg(1)=Δx(1),HnΔg(n1)=Δx(n1).\begin{aligned} \boldsymbol{H}_n \Delta \boldsymbol{g}^{(0)} &=\Delta \boldsymbol{x}^{(0)}, \\ \boldsymbol{H}_n \Delta \boldsymbol{g}^{(1)} &=\Delta \boldsymbol{x}^{(1)}, \\ & \vdots \\ \boldsymbol{H}_n \Delta \boldsymbol{g}^{(n-1)} &=\Delta \boldsymbol{x}^{(n-1)} . \end{aligned}

This set of equations can be represented as

Hn[Δg(0),Δg(1),,Δg(n1)]=[Δx(0),Δx(1),,Δx(n1)]\boldsymbol{H}_n\left[\Delta \boldsymbol{g}^{(0)}, \Delta \boldsymbol{g}^{(1)}, \ldots, \Delta \boldsymbol{g}^{(n-1)}\right]=\left[\Delta \boldsymbol{x}^{(0)}, \Delta \boldsymbol{x}^{(1)}, \ldots, \Delta \boldsymbol{x}^{(n-1)}\right]

Note that QQ satisfies

Q[Δx(0),Δx(1),,Δx(n1)]=[Δg(0),Δg(1),,Δg(n1)]\boldsymbol{Q}\left[\Delta \boldsymbol{x}^{(0)}, \Delta \boldsymbol{x}^{(1)}, \ldots, \Delta \boldsymbol{x}^{(n-1)}\right]=\left[\Delta \boldsymbol{g}^{(0)}, \Delta \boldsymbol{g}^{(1)}, \ldots, \Delta \boldsymbol{g}^{(n-1)}\right]

and

Q1[Δg(0),Δg(1),,Δg(n1)]=[Δx(0),Δx(1),,Δx(n1)]\boldsymbol{Q}^{-1}\left[\Delta \boldsymbol{g}^{(0)}, \Delta \boldsymbol{g}^{(1)}, \ldots, \Delta \boldsymbol{g}^{(n-1)}\right]=\left[\Delta \boldsymbol{x}^{(0)}, \Delta \boldsymbol{x}^{(1)}, \ldots, \Delta \boldsymbol{x}^{(n-1)}\right]

Therefore, if [Δg(0),Δg(1),,Δg(n1)]\left[\Delta g^{(0)}, \Delta g^{(1)}, \ldots, \Delta g^{(n-1)}\right] is nonsingular, then Q1Q^{-1} is determined uniquely after nn steps, via

Q1=Hn=[Δx(0),Δx(1),,Δx(n1)][Δg(0),Δg(1),,Δg(n1)]1\boldsymbol{Q}^{-1}=\boldsymbol{H}_n=\left[\Delta \boldsymbol{x}^{(0)}, \Delta \boldsymbol{x}^{(1)}, \ldots, \Delta \boldsymbol{x}^{(n-1)}\right]\left[\Delta \boldsymbol{g}^{(0)}, \Delta \boldsymbol{g}^{(1)}, \ldots, \Delta \boldsymbol{g}^{(n-1)}\right]^{-1}

As a consequence, we conclude that if Hn\boldsymbol{H}_n satisfies the equations HnΔg(i)=\boldsymbol{H}_n \Delta \boldsymbol{g}^{(i)}= Δx(i),0in1\Delta \boldsymbol{x}^{(i)}, 0 \leq i \leq n-1, then the algorithm x(k+1)=x(k)αkHkg(k)\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)}-\alpha_k \boldsymbol{H}_k \boldsymbol{g}^{(k)}, αk=argminα0f(x(k)αHkg(k))\alpha_k=\arg \min _{\alpha \geq 0} f\left(\boldsymbol{x}^{(k)}-\alpha \boldsymbol{H}_k \boldsymbol{g}^{(k)}\right), is guaranteed to solve problems with quadratic objective functions in n+1n+1 steps, because the update x(n+1)=\boldsymbol{x}^{(n+1)}= x(n)αnHng(n)\boldsymbol{x}^{(n)}-\alpha_n \boldsymbol{H}_n \boldsymbol{g}^{(n)} is equivalent to Newton's algorithm. In fact, as we shall see below, such algorithms solve quadratic problems of nn variables in at most nn steps.

The considerations above illustrate the basic idea behind the quasi-Newton methods. Specifically, quasi-Newton algorithms have the form

d(k)=Hkg(k)αk=argminα0f(x(k)+αd(k))x(k+1)=x(k)+αkd(k)\begin{aligned} \boldsymbol{d}^{(k)} &=-\boldsymbol{H}_k \boldsymbol{g}^{(k)} \\ \alpha_k &=\underset{\alpha \geq 0}{\arg \min } f\left(\boldsymbol{x}^{(k)}+\alpha \boldsymbol{d}^{(k)}\right) \\ \boldsymbol{x}^{(k+1)} &=\boldsymbol{x}^{(k)}+\alpha_k \boldsymbol{d}^{(k)} \end{aligned}

where the matrices H0,H1,\boldsymbol{H}_0, \boldsymbol{H}_1, \ldots are symmetric. In the quadratic case these matrices are required to satisfy

Hk+1Δg(i)=Δx(i),0ik,\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, \quad 0 \leq i \leq k,

where Δx(i)=x(i+1)x(i)=αid(i)\Delta \boldsymbol{x}^{(i)}=\boldsymbol{x}^{(i+1)}-\boldsymbol{x}^{(i)}=\alpha_i \boldsymbol{d}^{(i)} and Δg(i)=g(i+1)g(i)=QΔx(i)\Delta \boldsymbol{g}^{(i)}=\boldsymbol{g}^{(i+1)}-\boldsymbol{g}^{(i)}=\boldsymbol{Q} \Delta \boldsymbol{x}^{(i)}. It turns out that quasi-Newton methods are also conjugate direction methods, as stated in the following.

Theorem

Consider a quasi-Newton algorithm applied to a quadratic function with Hessian Q=Q\boldsymbol{Q}=\boldsymbol{Q}^{\top} such that for 0k<n10 \leq k<n-1,

Hk+1Δg(i)=Δx(i),0ik,\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, \quad 0 \leq i \leq k,

where Hk+1=Hk+1\boldsymbol{H}_{k+1}=\boldsymbol{H}_{k+1}^{\top}. If αi0,0ik\alpha_i \neq 0,0 \leq i \leq k, then d(0),,d(k+1)\boldsymbol{d}^{(0)}, \ldots, \boldsymbol{d}^{(k+1)} are Q\boldsymbol{Q} conjugate.

Proof

We proceed by induction. We begin with the k=0k=0 case: that d(0)\boldsymbol{d}^{(0)} and d(1)\boldsymbol{d}^{(1)} are Q\boldsymbol{Q}-conjugate. Because α00\alpha_0 \neq 0, we can write d(0)=Δx(0)/α0\boldsymbol{d}^{(0)}=\Delta \boldsymbol{x}^{(0)} / \alpha_0. Hence,

d(1)Qd(0)=g(1)H1Qd(0)=g(1)H1QΔx(0)α0=g(1)H1Δg(0)α0=g(1)Δx(0)α0=g(1)d(0)\begin{aligned} \boldsymbol{d}^{(1) \top} \boldsymbol{Q} \boldsymbol{d}^{(0)} &=-\boldsymbol{g}^{(1) \top} \boldsymbol{H}_1 \boldsymbol{Q} \boldsymbol{d}^{(0)} \\ &=-\boldsymbol{g}^{(1) \top} \boldsymbol{H}_1 \frac{\boldsymbol{Q} \Delta \boldsymbol{x}^{(0)}}{\alpha_0} \\ &=-\boldsymbol{g}^{(1) \top} \frac{\boldsymbol{H}_1 \Delta \boldsymbol{g}^{(0)}}{\alpha_0} \\ &=-\boldsymbol{g}^{(1) \top} \frac{\Delta \boldsymbol{x}^{(0)}}{\alpha_0} \\ &=-\boldsymbol{g}^{(1) \top} \boldsymbol{d}^{(0)} \end{aligned}

But g(1)d(0)=0\boldsymbol{g}^{(1) \top} \boldsymbol{d}^{(0)}=0 as a consequence of α0>0\alpha_0>0 being the minimizer of ϕ(α)=\phi(\alpha)= f(x(0)+αd(0))f\left(\boldsymbol{x}^{(0)}+\alpha \boldsymbol{d}^{(0)}\right) (see Exercise 11.1). Hence, d(1)Qd(0)=0\boldsymbol{d}^{(1) \top} \boldsymbol{Q} \boldsymbol{d}^{(0)}=0.

Assume that the result is true for k1k-1 (where k<n1k<n-1 ). We now prove the result for kk, that is, that d(0),,d(k+1)\boldsymbol{d}^{(0)}, \ldots, \boldsymbol{d}^{(k+1)} are Q\boldsymbol{Q}-conjugate. It suffices to show that d(k+1)Qd(i)=0,0ik\boldsymbol{d}^{(k+1) \top} \boldsymbol{Q} \boldsymbol{d}^{(i)}=0,0 \leq i \leq k. Given i,0iki, 0 \leq i \leq k, using the same algebraic steps as in the k=0k=0 case, and using the assumption that αi0\alpha_i \neq 0, we obtain

d(k+1)Qd(i)=g(k+1)Hk+1Qd(i)=g(k+1)d(i)\begin{aligned} \boldsymbol{d}^{(k+1) \top} \boldsymbol{Q} \boldsymbol{d}^{(i)} &=-\boldsymbol{g}^{(k+1) \top} \boldsymbol{H}_{k+1} \boldsymbol{Q} \boldsymbol{d}^{(i)} \\ & \vdots \\ &=-\boldsymbol{g}^{(k+1) \top} \boldsymbol{d}^{(i)} \end{aligned}

Because d(0),,d(k)\boldsymbol{d}^{(0)}, \ldots, \boldsymbol{d}^{(k)} are Q\boldsymbol{Q}-conjugate by assumption, we conclude from Lemma 10.210.2 that g(k+1)d(i)=0\boldsymbol{g}^{(k+1)^{\top}} \boldsymbol{d}^{(i)}=0. Hence, d(k+1)Qd(i)=0\boldsymbol{d}^{(k+1)^{\top}} \boldsymbol{Q} \boldsymbol{d}^{(i)}=0, which completes the proof.

By this theorem, we conclude that a quasi-Newton algorithm solves a quadratic of nn variables in at most nn steps.

Note that the equations that the matrices Hk\boldsymbol{H}_k are required to satisfy do not determine those matrices uniquely. Thus, we have some freedom in the way we compute the Hk\boldsymbol{H}_k. In the methods we describe, we compute Hk+1\boldsymbol{H}_{k+1} by adding a correction to Hk\boldsymbol{H}_k. In the following sections we consider three specific updating formulas.

The Rank One Correction Formula

In the rank one correction formula, the correction term is symmetric and has the form akz(k)z(k)a_k \boldsymbol{z}^{(k)} \boldsymbol{z}^{(k) \top}, where akRa_k \in \mathbb{R} and z(k)Rn\boldsymbol{z}^{(k)} \in \mathbb{R}^n. Therefore, the update equation is

Hk+1=Hk+akz(k)z(k).\boldsymbol{H}_{k+1}=\boldsymbol{H}_k+a_k \boldsymbol{z}^{(k)} \boldsymbol{z}^{(k) \top} .

Note that

rankz(k)z(k)=rank([z1(k)zn(k)][z1(k)zn(k)])=1\operatorname{rank} \boldsymbol{z}^{(k)} \boldsymbol{z}^{(k) \top}=\operatorname{rank}\left(\left[\begin{array}{c} z_1^{(k)} \\ \vdots \\ z_n^{(k)} \end{array}\right]\left[\begin{array}{lll} z_1^{(k)} & \cdots & z_n^{(k)} \end{array}\right]\right)=1

and hence the name rank one correction [it is also called the single-rank symmetric (SRS) algorithm]. The product z(k)z(k)\boldsymbol{z}^{(k)} \boldsymbol{z}^{(k) \top} is sometimes referred to as the dyadic product or outer product. Observe that if Hk\boldsymbol{H}_k is symmetric, then so is Hk+1\boldsymbol{H}_{k+1}.

Our goal now is to determine aka_k and z(k)\boldsymbol{z}^{(k)}, given Hk,Δg(k),Δx(k)\boldsymbol{H}_k, \Delta \boldsymbol{g}^{(k)}, \Delta \boldsymbol{x}^{(k)}, so that the required relationship discussed before is satisfied; namely,

Hk+1Δg(i)=Δx(i),i=1,,k\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, i=1, \ldots, k. To begin, let us first consider the condition Hk+1Δg(k)=Δx(k)\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(k)}=\Delta \boldsymbol{x}^{(k)}. In other words, given Hk,Δg(k)\boldsymbol{H}_k, \Delta \boldsymbol{g}^{(k)}, and Δx(k)\Delta \boldsymbol{x}^{(k)}, we wish to find aka_k and z(k)\boldsymbol{z}^{(k)} to ensure that

Hk+1Δg(k)=(Hk+akz(k)z(k))Δg(k)=Δx(k).\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(k)}=\left(\boldsymbol{H}_k+a_k \boldsymbol{z}^{(k)} \boldsymbol{z}^{(k) \top}\right) \Delta \boldsymbol{g}^{(k)}=\Delta \boldsymbol{x}^{(k)} .

First note that z(k)Δg(k)\boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)} is a scalar. Thus,

Δx(k)HkΔg(k)=(akz(k)Δg(k))z(k),\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}=\left(a_k \boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)}\right) \boldsymbol{z}^{(k)},

and hence

z(k)=Δx(k)HkΔg(k)ak(z(k)Δg(k)).\boldsymbol{z}^{(k)}=\frac{\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}}{a_k\left(\boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)}\right)} .

We can now determine

akz(k)z(k)=(Δx(k)HkΔg(k))(Δx(k)HkΔg(k))ak(z(k)Δg(k))2.a_k \boldsymbol{z}^{(k)} \boldsymbol{z}^{(k) \top}=\frac{\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)^{\top}}{a_k\left(\boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)}\right)^2} .

Hence,

Hk+1=Hk+(Δx(k)HkΔg(k))(Δx(k)HkΔg(k))ak(z(k)Δg(k))2.\boldsymbol{H}_{k+1}=\boldsymbol{H}_k+\frac{\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)^{\top}}{a_k\left(\boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)}\right)^2} .

The next step is to express the denominator of the second term on the righthand side of the equation above as a function of the given quantities Hk\boldsymbol{H}_k, Δg(k)\Delta \boldsymbol{g}^{(k)}, and Δx(k)\Delta \boldsymbol{x}^{(k)}. To accomplish this, premultiply Δx(k)HkΔg(k)=\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}= (akz(k)Δg(k))z(k)\left(a_k \boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)}\right) \boldsymbol{z}^{(k)} by Δg(k)\Delta \boldsymbol{g}^{(k) \top} to obtain

Δg(k)Δx(k)Δg(k)HkΔg(k)=Δg(k)akz(k)z(k)Δg(k)\Delta \boldsymbol{g}^{(k) \top} \Delta \boldsymbol{x}^{(k)}-\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}=\Delta \boldsymbol{g}^{(k) \top} a_k \boldsymbol{z}^{(k)} \boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)}

Observe that aka_k is a scalar and so is Δg(k)z(k)=z(k)Δg(k)\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{z}^{(k)}=\boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)}. Thus,

Δg(k)Δx(k)Δg(k)HkΔg(k)=ak(z(k)Δg(k))2\Delta \boldsymbol{g}^{(k) \top} \Delta \boldsymbol{x}^{(k)}-\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}=a_k\left(\boldsymbol{z}^{(k) \top} \Delta \boldsymbol{g}^{(k)}\right)^2

Taking this relation into account yields

Hk+1=Hk+(Δx(k)HkΔg(k))(Δx(k)HkΔg(k))Δg(k)(Δx(k)HkΔg(k)).\boldsymbol{H}_{k+1}=\boldsymbol{H}_k+\frac{\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)^{\top}}{\Delta \boldsymbol{g}^{(k) \top}\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)} .

We summarize the above development in the following algorithm.

Rank One Algorithm

  1. Set k:=0k:=0; select x(0)\boldsymbol{x}^{(0)} and a real symmetric positive definite H0\boldsymbol{H}_0.
  2. If g(k)=0\boldsymbol{g}^{(k)}=\mathbf{0}, stop; else, d(k)=Hkg(k)\boldsymbol{d}^{(k)}=-\boldsymbol{H}_k \boldsymbol{g}^{(k)}.
  3. Compute
    αk=argminα0f(x(k)+αd(k))=g(k)Td(k)d(k)TQd(k) when f is quadratic, x(k+1)=x(k)+αkd(k)\begin{aligned} \alpha_k &=\underset{\alpha \geq 0}{\arg \min } f\left(\boldsymbol{x}^{(k)}+\alpha \boldsymbol{d}^{(k)}\right) \\ &=-{\boldsymbol{g}^{(k)T}\boldsymbol{d}^{(k)}\over \boldsymbol{d}^{(k)T}\boldsymbol{Q}\boldsymbol{d}^{(k)}}\text{ when $f$ is quadratic, } \\ \boldsymbol{x}^{(k+1)} &=\boldsymbol{x}^{(k)}+\alpha_k \boldsymbol{d}^{(k)} \end{aligned}
  4. Compute
    Δx(k)=αkd(k)Δg(k)=g(k+1)g(k),Hk+1=Hk+(Δx(k)HkΔg(k))(Δx(k)HkΔg(k))Δg(k)(Δx(k)HkΔg(k))\begin{aligned} &\Delta \boldsymbol{x}^{(k)}=\alpha_k \boldsymbol{d}^{(k)} \\ &\Delta \boldsymbol{g}^{(k)}=\boldsymbol{g}^{(k+1)}-\boldsymbol{g}^{(k)}, \\ &\boldsymbol{H}_{k+1}=\boldsymbol{H}_k+\frac{\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)^{\top}}{\Delta \boldsymbol{g}^{(k) \top}\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)} \end{aligned}
  5. Set k:=k+1k:=k+1; go to step 2 . The rank one algorithm is based on satisfying the equation
    Hk+1Δg(k)=Δx(k).\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(k)}=\Delta \boldsymbol{x}^{(k)} .
    However, what we want is
    Hk+1Δg(i)=Δx(i),i=0,1,,k.\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, \quad i=0,1, \ldots, k .
    It turns out that the above is, in fact, true automatically, as stated in the following theorem.

Theorem

For the rank one algorithm applied to the quadratic with HessianQ=Q\operatorname{sian} \boldsymbol{Q}=\boldsymbol{Q}^{\top}, we have Hk+1Δg(i)=Δx(i),0ik\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, 0 \leq i \leq k.

Proof

We prove the result by induction. From the discussion before the theorem, it is clear that the claim is true for k=0k=0. Suppose now that the theorem is true for k10k-1 \geq 0; that is, HkΔg(i)=Δx(i),i<k\boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, i<k. We now show that the theorem is true for kk. Our construction of the correction term ensures that

Hk+1Δg(k)=Δx(k).\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(k)}=\Delta \boldsymbol{x}^{(k)} .

So we only have to show that

Hk+1Δg(i)=Δx(i),i<k.\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, \quad i<k .

To this end, fix i<ki<k. We have

Hk+1Δg(i)=HkΔg(i)+(Δx(k)HkΔg(k))(Δx(k)HkΔg(k))Δg(k)(Δx(k)HkΔg(k))Δg(i)\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}+\frac{\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)^{\top}}{\Delta \boldsymbol{g}^{(k) \top}\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)} \Delta \boldsymbol{g}^{(i)}

By the induction hypothesis, HkΔg(i)=Δx(i)\boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}. To complete the proof, it is enough to show that the second term on the right-hand side of the equation above is equal to zero. For this to be true it is enough that

(Δx(k)HkΔg(k))Δg(i)=Δx(k)Δg(i)Δg(k)HkΔg(i)=0\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)^{\top} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(i)}-\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}=0

Indeed, since

Δg(k)HkΔg(i)=Δg(k)(HkΔg(i))=Δg(k)Δx(i)\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{g}^{(k) \top}\left(\boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}\right)=\Delta \boldsymbol{g}^{(k) \top} \Delta \boldsymbol{x}^{(i)}

by the induction hypothesis, and because Δg(k)=QΔx(k)\Delta g^{(k)}=Q \Delta x^{(k)}, we have

Δg(k)HkΔg(i)=Δg(k)Δx(i)=Δx(k)QΔx(i)=Δx(k)Δg(i).\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{g}^{(k) \top} \Delta \boldsymbol{x}^{(i)}=\Delta \boldsymbol{x}^{(k) \top} \boldsymbol{Q} \Delta \boldsymbol{x}^{(i)}=\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(i)} .

Hence,

(Δx(k)HkΔg(k))Δg(i)=Δx(k)Δg(i)Δx(k)Δg(i)=0\left(\Delta \boldsymbol{x}^{(k)}-\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right)^{\top} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(i)}-\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(i)}=0

which completes the proof.

The DFP Algorithm (Rank Two Update)

  1. Set k:=0k:=0; select x(0)\boldsymbol{x}^{(0)} and a real symmetric positive definite H0\boldsymbol{H}_0.
  2. If g(k)=0\boldsymbol{g}^{(k)}=\mathbf{0}, stop; else, d(k)=Hkg(k)\boldsymbol{d}^{(k)}=-\boldsymbol{H}_k \boldsymbol{g}^{(k)}.
  3. Compute
    αk=argminα0f(x(k)+αd(k)),x(k+1)=x(k)+αkd(k).\begin{aligned} \alpha_k &=\underset{\alpha \geq 0}{\arg \min } f\left(\boldsymbol{x}^{(k)}+\alpha \boldsymbol{d}^{(k)}\right), \\ \boldsymbol{x}^{(k+1)} &=\boldsymbol{x}^{(k)}+\alpha_k \boldsymbol{d}^{(k)} . \end{aligned}
  4. Compute
    Δx(k)=αkd(k),Δg(k)=g(k+1)g(k),Hk+1=Hk+Δx(k)Δx(k)Δx(k)Δg(k)[HkΔg(k)][HkΔg(k)]Δg(k)HkΔg(k).\begin{aligned} \Delta \boldsymbol{x}^{(k)} &=\alpha_k \boldsymbol{d}^{(k)}, \\ \Delta \boldsymbol{g}^{(k)} &=\boldsymbol{g}^{(k+1)}-\boldsymbol{g}^{(k)}, \\ \boldsymbol{H}_{k+1} &=\boldsymbol{H}_k+\frac{\Delta \boldsymbol{x}^{(k)} \Delta \boldsymbol{x}^{(k) \top}}{\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(k)}}-\frac{\left[\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right]\left[\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}\right]^{\top}}{\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}} . \end{aligned}
  5. Set k:=k+1k:=k+1; go to step 2 . We now show that the DFP algorithm is a quasi-Newton method, in the sense that when applied to quadratic problems, we have Hk+1Δg(i)Δx(i)\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)} \doteq \Delta \boldsymbol{x}^{(i)}, 0ik0 \leq i \leq k

Theorem

In the DFP algorithm applied to the quadratic with Hessian Q=Q\boldsymbol{Q}=\boldsymbol{Q}^{\top}, we have Hk+1Δg(i)=Δx(i),0ik\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, 0 \leq i \leq k Proof. We use induction. For k=0k=0, we have

H1Δg(0)=H0Δg(0)+Δx(0)Δx(0)Δx(0)Δg(0)Δg(0)H0Δg(0)Δg(0)H0Δg(0)H0Δg(0)Δg(0)=Δx(0)\begin{aligned} \boldsymbol{H}_1 \Delta \boldsymbol{g}^{(0)} &=\boldsymbol{H}_0 \Delta \boldsymbol{g}^{(0)}+\frac{\Delta \boldsymbol{x}^{(0)} \Delta \boldsymbol{x}^{(0) \top}}{\Delta \boldsymbol{x}^{(0) \top} \Delta \boldsymbol{g}^{(0)}} \Delta \boldsymbol{g}^{(0)}-\frac{\boldsymbol{H}_0 \Delta \boldsymbol{g}^{(0)} \Delta \boldsymbol{g}^{(0) \top} \boldsymbol{H}_0}{\Delta \boldsymbol{g}^{(0) \top} \boldsymbol{H}_0 \Delta \boldsymbol{g}^{(0)}} \Delta \boldsymbol{g}^{(0)} \\ &=\Delta \boldsymbol{x}^{(0)} \end{aligned}

Assume that the result is true for k1k-1; that is, HkΔg(i)=Δx(i),0\boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, 0 \leq ik1i \leq k-1. We now show that Hk+1Δg(i)=Δx(i),0ik\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}, 0 \leq i \leq k. First, consider i=ki=k. We have

Hk+1Δg(k)=HkΔg(k)+Δx(k)Δx(k)Δx(k)Δg(k)Δg(k)HkΔg(k)Δg(k)HkΔg(k)HkΔg(k)Δg(k)=Δx(k)\begin{aligned} \boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(k)} &=\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}+\frac{\Delta \boldsymbol{x}^{(k)} \Delta \boldsymbol{x}^{(k) \top}}{\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(k)}} \Delta \boldsymbol{g}^{(k)}-\frac{\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)} \Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k}{\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}} \Delta \boldsymbol{g}^{(k)} \\ &=\Delta \boldsymbol{x}^{(k)} \end{aligned}

It remains to consider the case i<ki<k. To this end,

Hk+1Δg(i)=HkΔg(i)+Δx(k)Δx(k)Δx(k)Δg(k)Δg(i)HkΔg(k)Δg(k)HkΔg(k)HkΔg(k)Δg(i)=Δx(i)+Δx(k)Δx(k)Δg(k)(Δx(k)Δg(i))HkΔg(k)Δg(k)HkΔg(k)(Δg(k)Δx(i))\begin{aligned} \boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=& \boldsymbol{H}_k \Delta \boldsymbol{g}^{(i)}+\frac{\Delta \boldsymbol{x}^{(k)} \Delta \boldsymbol{x}^{(k) \top}}{\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(k)}} \Delta \boldsymbol{g}^{(i)}-\frac{\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)} \Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k}{\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}} \Delta \boldsymbol{g}^{(i)} \\ =& \Delta \boldsymbol{x}^{(i)}+\frac{\Delta \boldsymbol{x}^{(k)}}{\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(k)}}\left(\Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(i)}\right) \\ &-\frac{\boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}}{\Delta \boldsymbol{g}^{(k) \top} \boldsymbol{H}_k \Delta \boldsymbol{g}^{(k)}}\left(\Delta \boldsymbol{g}^{(k) \top} \Delta \boldsymbol{x}^{(i)}\right) \end{aligned}

Now,

Δx(k)Δg(i)=Δx(k)QΔx(i)=αkαid(k)Qd(i)=0,\begin{aligned} \Delta \boldsymbol{x}^{(k) \top} \Delta \boldsymbol{g}^{(i)} &=\Delta \boldsymbol{x}^{(k) \top} \boldsymbol{Q} \Delta \boldsymbol{x}^{(i)} \\ &=\alpha_k \alpha_i \boldsymbol{d}^{(k) \top} \boldsymbol{Q} \boldsymbol{d}^{(i)} \\ &=0, \end{aligned}

by the induction hypothesis and Theorem 11.1. The same arguments yield Δg(k)Δx(i)=0\Delta \boldsymbol{g}^{(k) \top} \Delta \boldsymbol{x}^{(i)}=0. Hence,

Hk+1Δg(i)=Δx(i)\boldsymbol{H}_{k+1} \Delta \boldsymbol{g}^{(i)}=\Delta \boldsymbol{x}^{(i)}

which completes the proof. We conclude that the DFP algorithm is a conjugate direction algorithm.